DSE Information

Specifying What to Index

The DSE determines how the document's location must be specified and whether or not the document is converted. The content type specified for a document determines whether or not the document is indexed.

A content collection makefile must specify the content-type attribute for a document as well as the content-type output by the DSE. If you specify a content-type that the server does not index, the NXT 4 server stores and retrieves the document as binary data.

The NXT 4 server indexes documents of the following content-type (MIME types):

text/html
text/xml
text/plain
application/pdf
application/msexcel
application/msword
application/mspowerpoint
application/x-html-body-text
application/vnd.oasis.opendocument.text


Note:The NXT 4 server uses the IFilterCOM interface to extract terms from Microsoft Office and Adobe Acrobat PDF files.



Note: To index ODT documents, you must install the corresponding version of OpenOffice IFilters that is included in the Apache OpenOffice 4.0 and higher, or similar software (for example, LibreOffice 4.3 and higher). You can download and install the required software manually from the official site.


Changing a Document's Native Format

The NXT 4 server stores documents in their native format which avoids translation mistakes that often occur when converting data between formats while storing and retrieving documents. However, storing a document in its native format also requires that the browser have a plug-in to display the format.

The NXT 4 server does not alter a document when storing or retrieving it but you can use a DSE to modify the format of a document before it is stored in a content collection.

Defining an Element for a DSE

Before you can use a DSE, you must define a dse element for it. The class-id attribute specifies the COM class ID of the DSE to use. The class IDs of the NXT DSEs are specified in makefile.dtd.

In addition to the dse elements defined in makefile.dtd, you can define other dse elements that use these DSEs. Each definition must have a unique id (name). This allows you to specify different parameters for the DSEs in each dse element definition.

Specifying a DSE

After you define a dse element, you can use the DSE by specifying the ID in the dse attribute of a document element. You may want to use certain options for one set of files and different options for another set of files. To make this possible, you should define two different dse elements, each specifying different options. For each document, specify the ID of the dse defined with the options you want the document to be converted with.

If you do not specify a dse attribute for a document, ccBuild uses the default DSE defined by default-dse-id in makefile.dtd.

Chaining DSEs

The DSE architecture supports chaining. This means that one DSE intakes the output provided by another DSE. When defining a dse element, use the chain attribute to specify another DSE it should chain to.

NXT 4 supports both Unicode and ANSI DSEs. Because of the differences in the DSE interfaces, chaining between the two interfaces is not recommended and in most situations will not work. A DSE can be written to handle the string translations itself, which would allow chaining between Unicode and ANSI DSEs but this does not occur automatically.

note icon The FSysDSE supplied with NXT 4 does not support chaining to other DSEs.

DSE Processing Sequence

The DSE is involved in the following sequence of events when building a content collection or update file.

  1. You specify the documents from which to build a content collection in a makefile passed to ccBuild. The makefile indicates which DSE to use to retrieve a document.
  2. ccBuild loads the specified DSE and receives the request to retrieve a document.
  3. ccBuild requests version information for the document. If the version is the same as that stored in the content collection, ccBuild skips to the next document. Otherwise, it goes to Step 4.
  4. The DSE retrieves the source document and delivers it to ccBuild.
  5. ccBuild stores the document in the content collection at the location specified in the makefile.

Optionally, a DSE can preprocess the document by adding, changing, and removing data or possibly even changing its format. The content-type specified for the document should be the final content-type after the DSE finishes.

File System DSE

The File System DSE (FSysDSE) imports documents using the file services provided by the operating system. FSysDSE reads files from disk and passes them to ccBuild or another DSE. The most common use of FSysDSE is to import graphics, HTML, Word, Excel, PowerPoint, PDF, and XML documents.

note icon The NXT 4 server supports importing XML documents by providing a server side translation of XML to HTML or DHTML, through a display filter for browsers that do not support display of XML.

When using FSysDSE to store a document, a content collection makefile should specify the content-type corresponding to the source document's data format. Typically, that content-type also corresponds to the application or plug-in that will handle the document once it reaches the browser.

FSysDSE stores the document in its native form. The indexer uses some internal filters to extract the text from document types such as PDF, Word, Excel, PowerPoint, WordPerfect, HTML, Text, and XML. The extracted text is used to index the documents, but the original document is stored in its native format.

Creating New DSEs

In addition to the FSysDSE shipped with NXT 4, you can create other DSEs using the DSE API. For more information, see the documentation that is included with NXT 4 Builder (../Rocket/NXT 4/Builder/dseapi/dseapi.nxt). By default, this collection is not installed. You will need to use Content Network Manager to mount the collection on your default site or wherever you want to have it accessible.